1 Análisis de los datos de Airbnb de la ciudad de Barcelona

1.1 Guardando la Api key de Google

[1] TRUE

1.2 Cargamos los datos de Barcelona para el análisis

1.3 Limpiamos los datos

test <- capture.output(str_trim(mll_data$neighbourhood_cleansed, side=c("both")))
mll_data$price=as.numeric(gsub("\\$","",mll_data$price))

1.4 Ahora, centremos la atencion en Barcelona y analicemos la ubicación y los precios

1.4.1 Definiremos los datos que queremos representar en el mapa: La densidad de las propiedades en Airbnb

map_data = mll_data[(mll_data$neighbourhood_cleansed == "el Raval"), ]

1.5 Obtenemos en mapa de Barcelona

MapStamen = qmap("barcelona", source = "stamen", zoom = 14, color = "bw")

MapGoogle=ggmap(
  ggmap = get_map(
    "Barcelona",
    zoom = 14, scale = "auto",
    maptype = "terrain",
    source = "google"),
  extent = "device",
  legend = "topright"
)

1.6 Ubiquemos la información de la ubicación de las propiedades en el mapa

MapGoogle + geom_point(data = map_data, aes(x = longitude, y = latitude), colour = "blue", size = 2, alpha = 0.25)

1.7 Ubiquemos la información de la ubicación de las propiedades en el mapa por tipo de habitación

MapGoogle + geom_point(data = map_data, aes(x = longitude, y = latitude), colour = "blue", size = 1, alpha = 1) +
  facet_wrap(~room_type)

1.8 Creemos un mapa que muestre las zonas de Barcelona y las intensidades, según el precio

bar_circle_size = 0.010 

MapGoogle + geom_point(data = map_data, aes(x = longitude, y = latitude), colour = "red", size = map_data$price*bar_circle_size, alpha = 0.5)

1.9 Podemos usar un mapa para mostrar las intensidades, según la ubicación:

MapGoogle +
  stat_density2d(aes(x = longitude, y = latitude, fill = ..level.., alpha = ..level..),
                 size = 1, data = map_data, geom = "polygon")

1.10 Podemos extender el análisis para el caso de Barcelona

MapGoogle=ggmap(
  ggmap = get_map(
    "Barcelona",
    zoom = 13, scale = "auto",
    maptype = "terrain",
    source = "google"),
  extent = "device",
  legend = "topright"
)

MapGoogle +
  stat_density2d(aes(x = longitude, y = latitude, fill = ..level.., alpha = ..level..),
                 size = 2, data = mll_data, geom = "polygon")

1.11 Crearemos un mapa que muestre las zonas de Barcelona y las intensidades, según el precio

mll_circle_size = 0.005 

MapGoogle + geom_point(data = mll_data, aes(x = longitude, y = latitude), colour = "red", size = mll_data$price*mll_circle_size, alpha = 0.5)

1.12 Descripción de los datos antes de ajustar un modelo de regresión lineal que prediga el precio

1.13 Examinamos la variable precios

hist(mll_data$price)

mean(mll_data$price, na.rm = TRUE)
[1] 89.62964

1.14 El precio puede diferir dependiendo de la localidad. Podemos verificarlo a continuación

                         neighbourhood_cleansed     price
1                                 Baró de Viver  22.50000
2                                      Can Baró  60.15385
3                                     Canyelles  42.50000
4                              Ciutat Meridiana  26.90000
5  Diagonal Mar i el Front Marítim del Poblenou 185.94022
6                              el Baix Guinardó  64.35659
7                                el Barri Gòtic  77.76323
8                         el Besòs i el Maresme  52.68317
9                                 el Bon Pastor  31.72222
10           el Camp d'en Grassot i Gràcia Nova  79.80480
11                   el Camp de l'Arpa del Clot  70.62202
12                                    el Carmel  94.37778
13                                      el Clot  49.39873
14                                      el Coll  45.35897
15                     el Congrés i els Indians  43.20339
16                                el Fort Pienc 111.09091
17                                  el Guinardó  74.61111
18            el Parc i la Llacuna del Poblenou  79.45082
19                                 el Poble Sec  81.41600
20                                  el Poblenou  86.47990
21                         el Putxet i el Farró  92.68500
22                                     el Raval  65.74434
23                          el Turó de la Peira  30.85000
24                                        Horta  38.63889
25                                  Hostafrancs  89.92800
26              l'Antiga Esquerra de l'Eixample 109.94456
27                               la Barceloneta  65.76154
28                                   la Bordeta  55.44348
29                       la Dreta de l'Eixample 158.32805
30                         la Font d'en Fargues  74.33333
31                        la Font de la Guatlla  61.61789
32                                 la Guineueta  64.75000
33                            la Marina de Port  83.38462
34                   la Marina del Prat Vermell 112.11111
35                   la Maternitat i Sant Ramon  50.36301
36               la Nova Esquerra de l'Eixample  78.39239
37                               la Prosperitat  35.68750
38                           la Sagrada Família  92.59495
39                                   la Sagrera  43.91304
40                                     la Salut  63.22549
41                                 la Teixonera  35.62500
42                             la Trinitat Nova  29.50000
43                            la Trinitat Vella  40.60000
44                             la Vall d'Hebron  38.50000
45                          la Verneda i la Pau  40.62903
46                            la Vila de Gràcia  85.98637
47                la Vila Olímpica del Poblenou 154.28070
48                                    les Corts  78.28636
49                                 les Roquetes  35.36364
50                              les Tres Torres 110.14706
51                                      Montbau  26.18750
52                                        Navas  38.29487
53                                    Pedralbes 105.94737
54                                        Porta  32.20513
55                      Provençals del Poblenou  62.53125
56                                  Sant Andreu  44.47458
57                                  Sant Antoni  89.98194
58                     Sant Genís dels Agudells  60.85714
59                       Sant Gervasi - Galvany 102.67732
60                   Sant Gervasi - la Bonanova  65.02326
61                     Sant Martí de Provençals  42.41667
62        Sant Pere, Santa Caterina i la Ribera  72.20000
63                                        Sants  71.81073
64                                Sants - Badal  61.76623
65                                       Sarrià  67.60606
66                                   Torre Baró  39.75000
67                                     Vallbona  30.00000
68                    Vallcarca i els Penitents  66.92381
69        Vallvidrera, el Tibidabo i les Planes 119.60000
70                                       Verdun  29.24000
71                Vilapicina i la Torre Llobeta  39.50943

1.15 O dependiendo del tipo de propiedad

table(mll_data$property_type)

            Aparthotel              Apartment      Bed and breakfast                   Boat 
                    44                  14980                    229                     48 
        Boutique hotel                  Cabin              Camper/RV Casa particular (Cuba) 
                   112                      1                      3                     43 
                  Cave                 Chalet            Condominium             Dome house 
                     2                      2                    364                      2 
                  Dorm              Farm stay            Guest suite             Guesthouse 
                     2                      2                    151                     31 
                Hostel                  Hotel                  House                   Loft 
                   137                     42                    334                    421 
                 Other     Serviced apartment             Tiny house              Townhouse 
                    92                    663                      4                     35 
                 Villa 
                    19 

1.16 Podemos examinar el precio según el tipo de propiedad, examinemos la asociacion entre precios y los evaluaciones de los consumidores

aggregate(price ~ property_type, data=mll_data, FUN=mean)
            property_type     price
1              Aparthotel 119.84091
2               Apartment  86.17323
3       Bed and breakfast  45.88053
4                    Boat 241.23256
5          Boutique hotel 220.41237
6                   Cabin  25.00000
7               Camper/RV 116.66667
8  Casa particular (Cuba)  44.80488
9                    Cave  45.00000
10                 Chalet 125.00000
11            Condominium  90.99725
12             Dome house  94.00000
13                   Dorm  65.00000
14              Farm stay  58.50000
15            Guest suite  55.04636
16             Guesthouse  52.48387
17                 Hostel  91.00000
18                  Hotel 319.68750
19                  House  76.31420
20                   Loft  84.23278
21                  Other  83.90110
22     Serviced apartment 166.95735
23             Tiny house  67.50000
24              Townhouse 116.37143
25                  Villa 219.00000
aggregate(price ~ review_scores_value, data=mll_data, FUN=mean)
  review_scores_value     price
1                   2  92.48000
2                   3  47.50000
3                   4  75.58824
4                   5  77.93103
5                   6  87.98605
6                   7 116.84419
7                   8 101.66025
8                   9  94.17351
9                  10  76.28139
cor(mll_data$price, mll_data$review_scores_value, use="pairwise.complete.obs")
[1] -0.07111042
cor(mll_data$price, mll_data$review_scores_rating, use="pairwise.complete.obs")
[1] -0.03170615
aggregate(price ~ review_scores_rating, data=mll_data, FUN=mean)
   review_scores_rating     price
1                    20  88.02439
2                    30  45.00000
3                    35 150.00000
4                    40  95.31111
5                    47  61.00000
6                    50  72.53333
7                    51  80.00000
8                    52  51.00000
9                    53 124.33333
10                   54  30.00000
11                   55  82.25000
12                   56 450.00000
13                   57  82.00000
14                   59 160.00000
15                   60  83.53416
16                   61  48.00000
17                   62  85.00000
18                   63 139.55556
19                   64 134.77778
20                   65 153.00000
21                   66  36.00000
22                   67 121.28205
23                   68  48.44444
24                   69 191.54545
25                   70  84.24691
26                   71 138.50000
27                   72 113.09524
28                   73  78.46875
29                   74  87.27273
30                   75 107.74510
31                   76 100.53333
32                   77  90.07463
33                   78 113.40580
34                   79  70.39024
35                   80  91.53138
36                   81 106.60000
37                   82 100.55752
38                   83 101.78261
39                   84  95.33065
40                   85  88.26255
41                   86 101.76667
42                   87  98.19737
43                   88  90.32972
44                   89  91.86161
45                   90  89.11321
46                   91  89.45518
47                   92 100.10476
48                   93  88.30426
49                   94  88.55067
50                   95  88.53890
51                   96  87.82254
52                   97  84.78470
53                   98  87.02399
54                   99  98.80780
55                  100  80.26882

1.17 Ajustemos un modelo lineal de regresión para predecir el precio

1.18 Crearemos la muestra de entrenamiento y de prueba

spl = sample.split(mll_data$price, 0.8)
train = subset(mll_data, spl == TRUE)
test  = subset(mll_data, spl == FALSE)

mean(train$price, na.rm = TRUE)
[1] 90.32812
mean(test$price, na.rm = TRUE)
[1] 86.81302

1.19 Ajustamos el modelo

model_lm = lm(price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude, data=train)
summary(model_lm)

Call:
lm(formula = price ~ bedrooms + review_scores_value + number_of_reviews + 
    square_feet + longitude + latitude, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-288.18  -48.05  -17.00   20.63  866.50 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -1.480e+04  1.959e+04  -0.755 0.450470    
bedrooms             4.521e+01  4.958e+00   9.119  < 2e-16 ***
review_scores_value -9.997e+00  7.487e+00  -1.335 0.182592    
number_of_reviews   -1.816e-01  5.301e-02  -3.426 0.000677 ***
square_feet          2.748e-02  1.056e-02   2.603 0.009593 ** 
longitude            6.345e+02  3.534e+02   1.795 0.073365 .  
latitude             3.273e+02  4.795e+02   0.683 0.495199    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 110.4 on 386 degrees of freedom
  (13733 observations deleted due to missingness)
Multiple R-squared:  0.3094,    Adjusted R-squared:  0.2987 
F-statistic: 28.82 on 6 and 386 DF,  p-value: < 2.2e-16

1.20 Calculemos el Error Cuadrado Medio

predict_lm = predict(model_lm, newdata = test)
MSE_lm = mean((predict_lm - test$price)^2, na.rm = TRUE)
MSE_lm
[1] 22315.98
data.frame( R2 = R2(predict_lm, test$price, na.rm = TRUE),
            RMSE = RMSE(predict_lm, test$price, na.rm = TRUE),
            MAE = MAE(predict_lm, test$price, na.rm = TRUE))
         R2     RMSE    MAE
1 0.1230678 149.3853 75.977

1.21 Examinemos la validación cruzada

1.22 Imputaremos los valores NA de las variables numericas. No imputaremos dichos valores en el caso de la longitud y latitud

mice_mod = mice(mll_data[, names(mll_data) %in% c("price", "bedrooms", "review_scores_value", "number_of_reviews", "square_feet" )], method = 'pmm', seed = 500)

 iter imp variable
  1   1  bedrooms  square_feet  price  review_scores_value
  1   2  bedrooms  square_feet  price  review_scores_value
  1   3  bedrooms  square_feet  price  review_scores_value
  1   4  bedrooms  square_feet  price  review_scores_value
  1   5  bedrooms  square_feet  price  review_scores_value
  2   1  bedrooms  square_feet  price  review_scores_value
  2   2  bedrooms  square_feet  price  review_scores_value
  2   3  bedrooms  square_feet  price  review_scores_value
  2   4  bedrooms  square_feet  price  review_scores_value
  2   5  bedrooms  square_feet  price  review_scores_value
  3   1  bedrooms  square_feet  price  review_scores_value
  3   2  bedrooms  square_feet  price  review_scores_value
  3   3  bedrooms  square_feet  price  review_scores_value
  3   4  bedrooms  square_feet  price  review_scores_value
  3   5  bedrooms  square_feet  price  review_scores_value
  4   1  bedrooms  square_feet  price  review_scores_value
  4   2  bedrooms  square_feet  price  review_scores_value
  4   3  bedrooms  square_feet  price  review_scores_value
  4   4  bedrooms  square_feet  price  review_scores_value
  4   5  bedrooms  square_feet  price  review_scores_value
  5   1  bedrooms  square_feet  price  review_scores_value
  5   2  bedrooms  square_feet  price  review_scores_value
  5   3  bedrooms  square_feet  price  review_scores_value
  5   4  bedrooms  square_feet  price  review_scores_value
  5   5  bedrooms  square_feet  price  review_scores_value
mice_output = complete(mice_mod)

1.23 Seleccionaremos los datos para el análisis, considerando la data con los valores imputados y las variables donde no se ha hecho tal imputacion

covar=subset(mll_data, select=c("longitude", "latitude", "cancellation_policy"))
data_reg=cbind(covar, mice_output)
set.seed(1991) 
train.control = trainControl(method = "cv", number = 10)

1.24 Entrenemos un modelo de regresion lineal

model_vc_ols = train(price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude + cancellation_policy, data = data_reg, method = "lm", trControl = train.control)
print(model_vc_ols)
Linear Regression 

17763 samples
    7 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 15987, 15988, 15986, 15987, 15987, 15987, ... 
Resampling results:

  RMSE     Rsquared   MAE    
  88.5438  0.2499516  50.2345

Tuning parameter 'intercept' was held constant at a value of TRUE

1.25 Veamos si modelos alternativos pueden mejorar la prediccion. En primer lugar, probemos con una regresion de arbol

model_cart = rpart(price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude + cancellation_policy, data=train, method = "anova", na.action = na.omit)

1.26 Calculemos el Error Cuadrado Medio

predict_cart = predict(model_cart, newdata = test)
MSE_cart = mean((predict_cart-test$price)^2, na.rm = TRUE)
MSE_cart
[1] 15765.54

1.27 Calculemos medidas alternativas, como el Error Absoluto Medio (MAE) o el R cuadrado o la Raíz de la Desviación Cuadrática Media (RMSE)

data.frame( R2 = R2(predict_cart, test$price, na.rm = TRUE),
            RMSE = RMSE(predict_cart, test$price, na.rm = TRUE),
            MAE = MAE(predict_cart, test$price, na.rm = TRUE))
         R2     RMSE      MAE
1 0.0713769 125.5609 82.91956

1.28 Consideremos la validacion cruzada

model_vc_tree = train(price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude + cancellation_policy, data = data_reg, method = "rpart", trControl = train.control)
print(model_vc_tree)
CART 

17763 samples
    7 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 15986, 15988, 15987, 15987, 15987, 15986, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   MAE     
  0.01474493  89.46055  0.2339815  49.48721
  0.05368781  92.03803  0.1889530  51.03603
  0.17219969  97.87486  0.1547440  56.56656

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01474493.

1.29 Podemos dibujar el arbol

prp(model_cart)

1.30 Ahora, probemos con un Random Forest Model

model_rf = randomForest(price ~ bedrooms + review_scores_value + number_of_reviews + square_feet + longitude + latitude + cancellation_policy, data=train, importance= TRUE, method = "anova", na.action = na.omit)
model_rf

Call:
 randomForest(formula = price ~ bedrooms + review_scores_value +      number_of_reviews + square_feet + longitude + latitude +      cancellation_policy, data = train, importance = TRUE, method = "anova",      na.action = na.omit) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 2

          Mean of squared residuals: 10987.91
                    % Var explained: 36.59
predict_rf = predict(model_rf, ntree=4, newdata=test)
MSE_rf = mean((predict_rf-test$price)^2,na.rm = TRUE)
MSE_rf
[1] 17121.35

1.31 Calculemos medidas alternativas, como el Error Absoluto Medio (MAE) o el R cuadrado o la Raíz de la Desviación Cuadrática Media (RMSE)

predictions_rf = predict(model_rf, test)

data.frame( R2 = R2(predict_rf, test$price, na.rm = TRUE),
            RMSE = RMSE(predict_rf, test$price, na.rm = TRUE),
            MAE = MAE(predict_rf, test$price, na.rm = TRUE))
         R2     RMSE      MAE
1 0.3506037 130.8486 64.27475

1.32 Grafico importancia de las variables

varImpPlot(model_rf, sort=TRUE, main="Importancia", n.var=5)

1.33 Tabla con la importancia de las variables

var.imp=data.frame(importance(model_rf, type=2))
var.imp$Variables=row.names(var.imp)
var.imp[order(var.imp$IncNodePurity, decreasing = TRUE),]
                    IncNodePurity           Variables
bedrooms                1415412.3            bedrooms
number_of_reviews       1047097.7   number_of_reviews
square_feet             1005390.1         square_feet
latitude                 920292.6            latitude
longitude                780822.6           longitude
review_scores_value      529607.5 review_scores_value
cancellation_policy      148936.2 cancellation_policy